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Abstract 

We present Mantis, a new framework that automatically 
predicts program performance with high accuracy. Man- 
tis integrates techniques from programming language and 
machine learning for performance modeling, and is a radi- 
cal departure from traditional approaches. Mantis extracts 
program features, which are information about program 
execution runs, through program instrumentation. It uses 
machine learning techniques to select features relevant to 
performance and creates prediction models as a function 
of the selected features. Through program analysis, it then 
generates compact code slices that compute these feature 
values for prediction. Our evaluation shows that Mantis 
can achieve more than 93% accuracy with less than 10% 
training data set, which is a significant improvement over 
models that are oblivious to program features. The system 
generates code slices that are cheap to compute feature 
values. 

1 Introduction 

Today's programs are numerous and become more and 
more complex. For example, services running in data cen- 
ters are often large scale, and perform complicated oper- 
ations depending on input workload. Predicting how ap- 
plications will behave for given input workload is key to 
helping users and operators better manage those applica- 
tions. 

Predicting metrics (e.g., performance, resource con- 
sumption) has great applications in many usage scenar- 
ios. First, prediction of execution time of a service re- 
quest can be used for better workload management lfl8l . 
If the request is likely to violate service level agree- 
ments, the system can drop the request and allocate re- 
sources to other requests. Second, in scheduling applica- 
tions such as Map Reduce II151I23I . if we can predict exe- 
cution time of tasks, we can then schedule jobs more op- 
timally by considering where to map individual tasks to 
candidate resources and perform speculative execution in 
a timely fashion without spawning unnecessary processes. 
Third, prediction can help with better resource provision- 
ing lfTTl[T2l[3"5l (e.g., how many servers should I use to 
run this job? Should I add more servers?). Fourth, with 
prediction, we can detect performance anomaly. If an op- 
eration takes much longer than a predicted time, we label 
it an anomaly for troubleshooting purposes. Finally, pre- 



diction can answer what if questions — how system be- 
havior changes when input workload changes or system 
configuration changes (14, 25l l33l . 

Despite all these opportunities and demands, prediction 
has not been in the mainstream. This is because it is very 
difficult to predict metrics with high accuracy for current 
practices — analytically modeling the system or treating 
the system as a black box and generating a transfer func- 
tion between input workload and output response. System 
execution inherently depends on program semantics (i.e., 
internals of how the program works), thus prediction de- 
pends on program semantics. For example, certain meta- 
data of programs (e.g., image resolution and depth) is a 
cache of program semantics. One way to obtain program 
semantics is to ask its details to its developers. In real- 
ity, however, this is not feasible due to the abundance and 
complexity of programs. In this work, we aim to automati- 
cally extract program semantics without developers of the 
program and use them to create better prediction models 
for system performance (execution time). In particular, we 
focus on predicting with different input workload in the 
same environment (i.e., machine). 

Mantis is a system that achieves this goal by combining 
programming language and machine learning techniques 
in a novel way (Section ID). Mantis consists of three key 
components: feature instrumentation, model generation, 
and code snippets generation for computing feature val- 
ues when predicting. To capture program semantics with- 
out programmer assistance, we begin by extracting a po- 
tentially large number of program features that capture the 
characteristics of program execution by running programs 
instrumented with code analysis (Section 0). Next, we 
use machine learning techniques to select important, rele- 
vant features to create the prediction models (Section 01). 
Finally, program slicing computes small code snippets 
that compute features needed by the model for prediction 
(Section[5]). This component also guides model generation 
to choose features that can be computed cheaply. 

We evaluate our system by applying the framework to 
two applications, Lucene search engine and ImageJ, an 
image processing applications (Section [6]). We show that 
Mantis can achieve more than 93% accuracy with less 
than 10% training data set. We compare the model gen- 
erated by Mantis with blackbox approaches that rely on 
workload size and input configurations (e.g., command- 
line arguments) and show that Mantis can significantly 



1 



Feature schemas 



Program 



Feature 




instrumentor 


1 



Instrumented 
program 



Feature evaluation costs 





Performance metrics? 
feature values. 




Final feature 
evaluators *~ 

Function over 
final features' 4 ^ 



Program 
slicer 



Function over 
selected features 





Prediction 


1 




modeler 



Figure 1 : Mantis architecture. 



outperform these models. We explain how slicing can ben- 
efit model generation and the overhead of computing se- 
lected features. Finally, in Section [7] we discuss related 
work, and conclude with future research directions in Sec- 
tion 1 

2 Mantis Overview 
2.1 Approach 

We address the problem of predicting performance met- 
rics quickly without actually running the entire program. 
Traditionally, researchers have taken a stance in two 
camps for prediction — modeling systems analytically 
(e.g., queuing theory), or treating the system as a black 
box and creating a model between input (workload) and 
output (performance). However, these approaches do not 
work well due to their inherent limitations, i.e., the lack 
of knowledge about the program. We take a new white- 
box approach to generating prediction models. Unlike tra- 
ditional approaches, simply put, we extract information 
from execution of the program that contains a plethora of 
information. In particular, we extract as many features as 
possible from programs for given input data if extracting 
features incurs little overhead, and rely on machine learn- 
ing techniques to process the large amount of informa- 
tion dumped out. Machine learning techniques can infer 
key features from voluminous information and construct 
a robust model that predicts performance based on new 
program features. In summary, our approach solves the 
prediction system problems by combining programming 
language and machine learning techniques in a novel way. 

To achieve our goal, we need to address three key ques- 
tions: 

1 . What are good program features? How do we extract 
these feature values? 

2. Among many features, which ones are relevant to 



performance metrics? How do we model perfor- 
mance with relevant features? 

3. How do we automatically generate code to compute 
feature values for prediction? 

We present Mantis, a new prediction architecture that 
addresses the three questions above. There are three main 
components, each of which addresses a key question. 

2.2 Architecture 

FigureQ~]shows the Mantis architecture, a novel prediction 
framework that combines programming language tech- 
niques with machine learning techniques. This architec- 
ture shows the offline part for generating prediction mod- 
els. 

Mantis consists of three major components: feature in- 
strumentation and profiling, prediction model generation, 
and feature evaluator generation. The feature instrumentor 
analyzes the code of the program and automatically adds 
instrumentation code that extracts program features. Then 
the profiler runs this instrumented program with sample 
input data to collect performance metrics and feature val- 
ues. This profiling can generate a large number of features 
within the budget of instrumentation overhead. Then, the 
model generator runs machine learning algorithms to gen- 
erate a prediction model, i.e., select a subset of key fea- 
tures that are relevant to the performance metrics and cre- 
ate a function of the selected features to predict the perfor- 
mance metrics with high accuracy. To use the model, we 
need a way to compute feature values. The feature evalua- 
tor generator uses program slicing to automatically extract 
small code snippets (which we call feature evaluators) that 
compute feature values from the instrumented program. 
Ideally, for a feature evaluator, the technique includes only 
program statements that affect the feature value of the 
evaluator. Finally, there is a feedback loop from the slicer 
to the model generator. We may not be able to use some of 
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the features selected by the model generator. If a selected 
feature is expensive to compute (e.g., we have to run the 
entire program to compute the feature value), we reject 
the feature by notifying the cost of computing the feature 
value to the model generator. The prediction model gener- 
ator creates a new model after excluding the rejected fea- 
ture^). This loop may run multiple times depending on 
scenarios. When the program slicer can generate all the 
feature evaluators of the selected features (cheaply), the 
entire process ends, and the tool produces final features, 
prediction model, and feature evaluators. 

Once a prediction model is generated, it is used to fore- 
cast a performance metric of interest for a new input as 
shown in Figure[2] The example has k feature evaluators. 
The new input is sent to each feature evaluator to com- 
pute its feature value, and the prediction model computes 
an estimate using all the feature values. 

In the following, we explain these components in detail 
and evaluate the system. 

3 Feature Instrumentation 

We extract program features relevant to performance (e.g., 
execution time). We choose features with the following 
goals in mind. First, the features should capture the behav- 
ior of program performance. Second, the features should 
be accurate and easy to compute. For example, we avoid 
relying on inaccurate timer resolution. Third, the features 
should be collected with low overhead. We aim to run our 
instrumented program to collect feature values and per- 
formance metrics at the same time instead of running the 
original program to get performance metrics, running the 
instrumented program to get feature values, and joining 
the data. The latter is not accurate when the program has 
non-determinism. 

In the following, we first describe what program fea- 
tures we instrument and then present how to create instru- 
mented programs. We use Java programs as examples, but 
our techniques are generally applicable to other program- 
ming languages. 

3.1 Features 

The features we choose are loop counts, branch counts, 
and variable values in different versions. We discuss the 



rationales of choosing these features below. 

Loops When a program repeats computation, the exe- 
cution time depends on how many times the program re- 
peats. We introduce loop counts to capture this behav- 
ior. We instrument all loop constructs (e.g., while and 
for) in the program. If there are nested loops, we add a 
loop count for each loop. The following example shows a 
nested loop. The outer loop performs reading a line from 
a file, and the inner loop performs a search operation n 
times. 

/ / original code 
while ( line=readLine () ) { 
for (int i=0; i<n; ++i) 
search (line, i) ; 

} 

/ / instrumented code 
while ( line=readLine () ) { 
++mantis_loop_cntl ; 
for (int i=0; i<n; ++i) { 
++mantis_loop_cnt2 ; 
search (line, i) ; 

} 

} 

Method invocation Another way to repeat computation 
is to use a recursive procedure. The execution time de- 
pends on how many times the program invokes recursive 
methods. To capture this behavior, we introduce method 
invocation counts, each of which is incremented when a 
method is invoked. The following example shows an ex- 
ample program that traverses a tree structure and com- 
putes an aggregated metric. 

/ / original code 
process (node n) { 

if (cond) return; 

process (n . 1 ) ; 

process (n . r ) ; 

compute (n) ; 

} 

/ / instrumented code 
process (node n) { 

++mantis_methodinv_cnt ; 

if (cond) return; 

process (n . 1) ; 

process (n . r ) ; 

compute (n) ; 

} 

Branches Often the execution time changes depending 
on which control flow path the program takes. This can 
be captured by adding branch information. The follow- 
ing example shows that depending on the conditional, the 
program takes two different paths with very different exe- 
cution times. 
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/ / original code 

if (flag) { lightweightCompute ( ) ; } 
else { heavyCompute ( ) ; } 
/ / instrumented code 
if (flag) { 

++mantis_branch_cnt 1 ; 

lightweightCompute () ; 
} else { 

++mantis_branch_cnt2 ; 

heavyCompute ( ) ; 

} 

We add a branch counter for each branch in the pro- 
gram. It counts how many times a particular branch is 
taken. For example, if the program takes a particular 
branch once depending on a conditional, the counter value 
is either 1 or 0. If a branch is taken multiple times, its 
value reflects that. 

Variable values We also instrument versions of variable 
values to characterize the program execution. We focus 
on primitive variables (short, int, long, float, double, char, 
and boolean variables) and collect the first k values when- 
ever a variable is assigned to a value. Our intuition is that 
often the variable values obtained from input parameters 
and configurations are changing infrequently, and these 
values tend to affect program execution by changing con- 
trol flow. We track both class field variables and local vari- 
ables. 

In the following example, the execution time of the pro- 
gram is dominated by the input argument n of computeQ, 
which comes from a preprocessed variable, which is done 
quickly. We collect the assigned value of n as one of our 
program features. 

/ / original code 

n = preprocess ( ) ; 

compute (n) ; 

/ / instrumented code 

n = preprocess () ; 

mantis_n_data [cur_ptr++] = n; 

compute (n) ; 

Exception counts For certain inputs, the program may 
take a control flow that throws and handles errors. This 
path is not a common case the program takes, so the ex- 
ecution time of the program is likely to change signifi- 
cantly. We add an exception count for each exception han- 
dling part of the program to capture this behavior. 

To collect these features in multi-threaded object- 
oriented programs, we need to summarize the features 
across objects and across threads. We sum up loop count 
and branch count across objects, and also keep a single 
array of a variable for all objects created. To handle multi- 
ple threads we maintain a separate instrumentation object 
that captures features per thread, and merge feature val- 
ues at the end of program execution. For loop and branch 



/ / original code 
try { compute ( ) ; } 

catch (Exception e) { error (e); } 
/ / instrumented code 
try { compute ( ) ; } 
catch (Exception e) { 

++mantis_ex_cnt ; 

error (e) ; 

} 

counts, we sum up those counts across threads. For ver- 
sions of variables, we compute the mean of each version 
across threads. 

3.2 Instrumentor 

To instrument program to obtain program features, we 
perform code analysis and transformation. In particular, 
we use source code analysi^] to construct abstract syn- 
tax trees (ASTs), and manipulate the constructed ASTs 
by adding new nodes representing loop counts, method 
invocation counts, branch counts, exception counts, and 
variable versions to the trees. 

We use the instrumented program to capture the exe- 
cution time of the original program as well as to cap- 
ture program feature values. To achieve low overhead, 
we employ three techniques. First, we perform selective 
profiling of programs. We focus on application programs 
and do not instrument system libraries (e.g., toString() or 
equals()). Second, we use a procedure that removes instru- 
mentation from the part that incurs high overhead until the 
overall instrumentation overhead is below our threshold 
(e.g., 5% of the original program execution time). Third, 
to avoid synchronization overhead of multiple threads ac- 
cessing the instrumentation variables, we use a thread- 
local data structure per thread, and merge data structures 
of the threads at the end of the program execution. 

After the instrumentation step, the profiler receives the 
instrumented program with test input data, runs the pro- 
gram with each input, and collects tuples, each of which is 
program execution time and feature values for each input 
data. The profiler then sends the tuples of (execution time, 
feature values) to the prediction model generator that per- 
forms feature selection and model creation. 

4 Prediction Modeling 

We instrument and profile programs to collect many fea- 
tures from program execution runs for modeling the per- 
formance metrics of the programs. However, we expect 
that a small but relevant set of features may explain the ex- 
ecution time well, and hence seek a compact model, i.e., 
a function of this small set of features, that accurately es- 
timates the execution time of the program. Among all the 

'We can also implement our instrumentation using bytecode analy- 
sis. With bytecode, we do not have local variable names we can refer to 
unless source code is compiled with the Java compiler debug option. 
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information, not all of them are expected to be useful for 
the model: Some of them may have no variability across 
different inputs, some have very weak or even no corre- 
lation to the execution time, and others are redundant to 
each other. However, we do not know which features are 
useful, but would like to determine a small subset of fea- 
tures that is most relevant to predicting the execution time, 
and are willing to sacrifice some of the small details in or- 
der to get the "big picture". 

To make the problem tractable, we constrain our mod- 
els to the multivariate polynomial family. We expect that a 
good program should have polynomial execution time on 
some (combination of) features, and a polynomial model 
up to certain degree can approximate well any nonlinear 
model (due to Taylor expansion). In addition, a compact 
polynomial model that predicts execution time well can 
provide an easy-to-understand explanation on what fac- 
tors are important in determining the execution time of 
the program, and then give program developers intuitive 
feedback on the performance of the program. 

In summary, what we need is an optimal strategy to pro- 
duce a (nonlinear) model on a small set of features from 
thousands of ones collected blindly. We rely on machine 
learning techniques (specifically, sparse regression with 
multivariate polynomial basis) to automatically infer this 
small subset features and construct a compact model to 
capture the dominant predictors of execution time. 

4.1 Background 

Least Square Regression Our feature instrumentation 
procedure outputs n data samples as tuples of {y^, x, }" =1 , 
where j/j G M. denotes the i th observation of execution 
time, and denotes the i th observation of the vector 
of m features. We use regression techniques to model 
the relationship between y and x, which assumes that 
Hi's are generated from y = /(x, /3) + e, where (3 = 
[Po, 02, ■ ■ ■] is a vector of weights to be determined 
for the model, and e is the white noise. Least square re- 
gression is a mathematical procedure for finding the best- 
fitting /(x, (3) to a given set of responses j/j by minimiz- 
ing the sum of the squares of the residuals ||2T1 . i.e., 

n 

min ^(^-/(x,,/?)) 2 . (1) 
i=i 

If a linear function /(x, /?) is used, we obtain linear least 
square regression, which can be easily extended to cre- 
ate nonlinear models by using nonlinear (e.g., polynomial, 
spline, etc.) basis functions of features x. 

Sparse Regression While widely used, least square re- 
gression has two major drawbacks: 1) When a large num- 
ber of features exist, least squares tend to create complex 
models and overfit the data, resulting in inferior predic- 
tion accuracy. 2) It is usually hard to interpret the results, 



because it tends to create models involving many feature 
terms, if not all of them. This does not satisfy us since 
we have a lot of features but desire only a small subset of 
them to contribute to the model. 

Regression with best subset selection finds for each 
k G {1,2,..., m} the subset of size k that gives small- 
est residual sum of squares. However, it is a discrete op- 
timization and is known to be NP-hard ET\ . In recent 
years a number of approaches based on model regular- 
ization have been proposed as efficient alternatives. Their 
main idea is to add a regularization term to problem (Q]i 
to control the complexity of the model, and make a trade- 
off between the regression error and the number of fea- 
tures used in the model. Among them, a widely used one 
is LASSO (Least Absolute Shrinkage and Selection Op- 
erator) f34l . which uses quantity ^Y^ijLi\^j\ *° penalize 
problem ([T). It effectively enforces many fij's to be 0, and 
selects a small subset of features (indexed by non-zero 
P/s) to build the model, which is usually compact and 
has better prediction accuracy than models created by or- 
dinary least square regression ||2T1 . Parameter A controls 
the complexity of the model: as A grows larger, fewer fea- 
tures are selected by the model. 

Being a convex optimization problem is the greatest ad- 
vantage of the LASSO method, and there exist fast algo- 
rithms to solve the problem efficiently even with large- 
scale datasets lfT7ll24l , LASSO also has nice theoretical 
and empirical properties, and under suitable assumptions, 
it can recover the true underlying model Ifl6ll34l . In ad- 
dition, LASSO can be easily extended to create nonlinear 
models (e.g., using polynomial basis functions of the fea- 
tures). 

4.2 Our Procedure 

We aim to use polynomial functions to model the execu- 
tion time, so that we can clearly see what kinds of non- 
linear terms on which features are important to the exe- 
cution time. To capture nonlinear effects of and interac- 
tions between multiple features, we expand the features 
x = [xi X2 ■ ■ ■ Xk], k < m to all the terms in the expan- 
sion of the degree-d polynomial (1 + xi + . . . + Xk) d , and 
use them to construct a multivariate polynomial function 
/(x, /3) for the regression. For example, using a degree-2 
polynomial with feature vector x = [x\ X2], we expand 
out (1 + x\ + X2) 2 to get terms 1, x\, x 2l x\, x\X2, x\, 
and use them as basis functions to construct the following 
function for regression: 

/(x) = Pq + P1X1 + P2X2 + Pzx\ + PaXix 2 + P^x\. 

Because we neither know which features are needed nor 
what kinds of nonlinear terms are necessary, an optimal 
but naive approach is to expand the degree-<i multivari- 
ate polynomial with all p features and use all the terms to 
construct the regression function. However, this approach 
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gives us ( m j~ d ) terms, which is large when m is on the 
order of thousands and even for small d, and will cause 
heavy burden on the computing of the regression model. 
Complete expansion on all features is not necessary, be- 
cause many of them have little contribution to the execu- 
tion time, and many of them are redundant to each other. 

For efficient computation, we adopt a 3-step approach 
for the feature selection and nonlinear model fitting: 
Step 1: Use the linear LASSO algorithm to filter out 
(many) features that hardly contribute to the execution 
time. Although this step may be suboptimal (mainly due 
to the non-linearity in the true underlying model), it is 
cheap, fast and scalable, and is provably better than the 
traditional feature selection methods that consider indi- 
vidual features one by one. 

Step 2: Do degree-d multivariate polynomial expansion 
on the features selected in step (1), and use all the terms 
from the expansion as the basis functions for the nonlinear 
model. 

Step 3: Use the LASSO method on the expanded features 
to pick out a subset of nonlinear terms to construct the 
model. 

With these three steps, we have developed an efficient 
procedure to select a small set of nonlinear terms to con- 
struct a compact and intuitive model. Our experimental 
results in Section [6] show our method can construct mod- 
els to accurately predict execution time for a variety of 
applications. 

5 Feature Evaluator Generation 

In this section we explain our feature evaluator genera- 
tion component. To generate a feature evaluator, i.e., a 
small code snippet that computes the value of a feature, 
we use a program slicing algorithm. Given a program and 
a slicing criterion, which is a program variable v at a pro- 
gram point p, static slicing ||36l computes a slice, which 
is an executable sub-program of the given program that 
yields the same value of v at p as the given program, on 
all inputs. The goal of static slicing is to yield as small 
a sub-program as possible. Figure [3] shows a slicing ex- 
ample. The original code performs reading lines from a 
file, executes expensive computation on each line, and 
accumulates processed values. Suppose we want to ex- 
tract the code part that affects the computation of the in- 
strumented variable mantis_loop_cntl. Ideally, the 
sheer should produce the sliced code, shown in Figure [3] 
that captures only code that really affects the variable. 

At a high level, our sheer captures intra-procedural and 
inter-procedural data dependencies and control dependen- 
cies of a slicing criterion. The produced slices must be 
executable since in our system the generated sub-program 
will be executed online on the given input to obtain the re- 
sult of the slicing criterion. This is a requirement clearly 



/ / original code 
int j; 

while ( line=readLine () ) { 
++mantis_loop_cntl ; 

j = j + expensive_processing (line) ; 

} 

// sliced code on the variable 
// mantis_loop_cntl 
while ( line=readLine () ) { 
+ +mantis_loop_cntl ; 

} 

Figure 3: A slicing example. 

different from most slicing research work motivated by 
debugging (e.g., 11321 ) whose goal is to highlight as few 
statements as possible that will aid the programmer debug 
a particular problem. Thus, they elide the constraint in the 
original slicing definition that the generated sub-program 
be executable. To achieve executability, we need to solve 
several engineering issues related to Java language fea- 
tures. 

Our slicing algorithm operates on expressions e, which 
may be of one of four kinds: a local variable v , a static 
field (i.e., a global variable) g, an abstract instance field 
(h, f) denoting instance field / of any object allocated at 
site h, or an abstract array element h, denoting any ele- 
ment of any array object allocated at site h. Abstractions 
of instance fields and array elements are required because 
static analysis cannot refer to concrete object addresses. 
Our slicing algorithm is not dependent upon the choice 
of abstraction, however, can easily be modified to use ab- 
stractions besides object allocation sites@ 

The slicer takes as input the given program and the slic- 
ing criterion c = (e,p), which is an expression e whose 
value is desired at program point p, and produces as output 
a corresponding slice. In our setting, e is always a static 
field g instrumented by us (e.g., a loop counter), and p 
is always the exit of a method of the program (e.g., the 
program's main method.) 

Our slicing algorithm is based on two algorithms (one 
from Horwitz, Reps, and Sagiv ||22|| and one from Reps, 
Horwitz, Sagiv, and Rosay |28l ). We summarize four 
steps of the algorithm. First, for each method, we con- 
struct a Program Dependence Graph (PDG). Then, for the 
entire program, we construct a System Dependence Graph 
(SDG), which is a set of PDGs where additional edges 
are created to capture interprocedural dependencies. We 
augment the SDG with summary edges by running the 
interprocedural data flow analysis algorithm in fl27l to 
solve context sensitivity problems. Finally, we run a 2- 
pass reachability algorithm on the augmented SDG. We 
explain individual steps more in detail below. 

2 The choice of abstraction affects the precision and scalability of the 
algorithm, and we found object allocation sites to strike a good tradeoff. 
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PDG and SDG Our slicing algorithm operates on 
Joeq |6) quad code, an intermediate representation for- 
mat based on registers. The vertices of a PDG represent 
quad code instructions (e.g., statements and predicates). 
The edges of a PDG represent data flow and control de- 
pendencies. An SDG also includes inter-procedural de- 
pendencies. A method call creates a call vertex and a set 
of actual-in and actual-out vertices. Each parameter of a 
method call creates an actual-in vertex, and a return value 
creates an actual-out vertex. A method entry creates an en- 
try vertex and a set of formal-in and formal-out vertices, 
which correspond to arguments and a return value respec- 
tively. A call edge is created to connect a call vertex of 
an call site to an entry vertex of the matching method. In 
addition, a linkage-entry edge is created from an actual-in 
vertex to a corresponding formal-in vertex, and a linkage- 
exit edge is created to link a formal-out vertex to an actual- 
out vertex. 

Augmented SDG An SDG does not capture context- 
sensitivity of method calls. Therefore, if a call site of 
a method is included in a slice, other call sites of the 
same method may be included even though they do not 
affect a slicing criterion. To remedy this problem, we 
build an augmented SDG by adding summary edges to an 
SDG. A summary edge connects an actual-in vertex to an 
actual-out vertex and summarizes the effect of the actual- 
in on the actual-out of a method call. To create summary 
edges, we use the backward RHS algorithm l27l . It prop- 
agates from formal-out vertices of the method based on 
data- and control- dependencies to calculate path edges 
of the method. A path edge is of the form (p, ei) — s- 
(p formal- out, e 2 ) meaning that e\ at program point p af- 
fects e 2 at p formal-out ■ it always ends with a formal-out 
of the method and is created within the method. Therefore, 
when there exists a path edge from a formal-in vertex to a 
formal-out vertex of a method, a summary edge is created 
connecting corresponding actual-in vertex and actual-out 
vertex of a call site to the method. 

2-pass algorithm To identify statements to include in a 
slice, we run a 2-pass reachability algorithm on an aug- 
mented SDG. The first pass starts from the program point 
of a given slicing criterion and goes backwards along all 
the edges in the augmented SDG but not along linkage- 
exit edges. As a result, when encountering a call site, the 
first pass does not go into the method body but uses sum- 
mary edges of the call site. The second pass starts from 
all actual-out vertices visited in the first pass and traverses 
backwards using all the edges but not using linkage-entry 
and call edges. This pass covers all methods that corre- 
spond to call sites identified in the first pass. In addition, 
it may find more call sites while traversing the body of 
the methods and use summary edges; this process adds 
additional actual-out vertices of the summary edges and 



the pass goes into the methods associated with the actual- 
outs. 

Slicing made practical A set of program statements 
identified by the described algorithm may not meet Java 
language requirements. This problem needs to be resolved 
to create executable slices. We list a few of the engi- 
neering issues we addressed for that. First, we need to 
handle accesses to static fields and heap locations (in- 
stance fields and array elements). Therefore, when build- 
ing an SDG, we identify all such accesses in a method 
and create formal-in vertices for those read and formal- 
out for those written along with corresponding actual-in 
and actual-out vertices. Second, there may be uninitial- 
ized parameters if they are not included in a slice. We 
opt to keep method signatures, hence we initialize them 
with default values. Third, there are methods not reach- 
able from a main method but rather called from JVM di- 
rectly (e.g., class initializers). These methods will not be 
included in a slice by the algorithm but still may affect the 
slicing criterion. Therefore, we do not slice out such code. 
Fourth, when a new object creation is in a slice, a corre- 
sponding constructor invocation may not. To address this, 
we create a control dependency between object creations 
and corresponding constructor invocations to ensure that 
they are also in the slice. Fifth, a constructor of a class ex- 
cept the Object class must include a call to a constructor 
of its parent class. Hence we include such calls when they 
are missing in a slice. Sixth, the first parameter of an in- 
stance method call is a reference to the associated object. 
Therefore if such a call site is in a slice, the first parameter 
has to be in the slice too and we ensure this. 

Final step Previous steps we described so far generate 
a slice of Joeq quad code. To generate the final Java byte 
code we can execute, we translate the Joeq quad code to 
Jasmin |4 J assembly code and use the Jasmin assembler to 
generate Java byte code. We take a simple approach that 
translates each quad instruction to a corresponding set of 
byte codes. During the process, since we do not have com- 
plete information on ordering between basic blocks, we 
add an explicit goto instruction at the end of each basic 
block. However this may lead to a cycle if a conditional 
branch in a loop is sliced out and replaced by goto. We 
ensure that no cycle is created by performing a DFS-like 
search and choosing a successor as a target of the goto 
instruction only if it can reach the exit of the method. An- 
other special case is JSR instruction that pushes the ad- 
dress of the next immediate opcode into an operand stack 
as its return address. However the next instruction may 
not be the same as one in the original program. Hence we 
add an extra goto with an appropriate target after the JSR 
operation. Our current translator is not optimized; we plan 
to optimize the use of stacks if needed in the future. 
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Discussion There are static and dynamic program slic- 
ing algorithms. They have tradeoffs between input cover- 
age and slice compactness. Static slicing works for all in- 
puts, but it may produce a bigger slice than dynamic slic- 
ing. Dynamic slicing includes only code that is actually 
executed for given inputs, but it does not cover all inputs. 
As a starting point, we chose static slicing as it guarantees 
to work for all inputs, but in the future we plan to explore 
dynamic slicing or hybrid slicing that combines static slic- 
ing and dynamic slicing if we need to improve slice com- 
pactness. In this paper, we tested our static slicing algo- 
rithms with simple programs, and we plan to evaluate the 
scalability of our algorithms with complicated programs. 

6 Evaluation 

We have implemented Mantis that works with Java pro- 
grams by extending existing machine learning and pro- 
gram analysis tools. We built the feature instrumentor atop 
Eclipse JDT AST libraries. JDT is a toolkit that provides 
APIs to access and manipulate Java source code. We add 
visitors to ASTs to add instrumentation code. The ba- 
sic instrumentation variables are thread local variables. 
The instrumentor also introduces static global variables 
that summarize feature values across threads and are used 
for slicing. We implemented our modeling procedure in 
Matlab. Finally, we extended JChord (5), a static and dy- 
namic Java program analysis tool, to implement our pro- 
gram slicing algorithm in Java and Datalog. In the cur- 
rent version, we have to patch models for native library 
functions manually to track dependencies inside native li- 
brary functions. The JChord slicer produces Joeq quad- 
code slices. To create final executable bytecode slices, we 
implemented a translator from quadcode to bytecode us- 
ing Jasmin J4|. 

To evaluate our system, we choose two applications, 
Lucene Search j8| and ImageJ [3 |, that involve intensive 
computation. We evaluate the prediction accuracy of our 
system in terms of prediction error (i.e., prediction accu- 
racy = 1 - prediction error) and compare it with blackbox 
approaches. Prediction error is computed using the equa- 
tion: 



prediction error = 



Ipredicted time - actual time| 
actual time 



We show the sensitivity to the size of training data and the 
regularization parameter A, and the results of prediction. 
Traditional blackbox approaches fail to predict execution 
time with low prediction error, but our system can con- 
struct an online predictor that can predict execution time 
accurately. Note that, in presenting the results of our sys- 
tem, we use only features that can be compuated cheaply 
by iterating over feature selection and program slicing for 
rejecting expensive features. 
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Figure 4: For Lucene, the blackbox approach fails to pre- 
dict execution time accurately. 
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Figure 5: Predicted time vs. execution time of the Lucene 
search application. Using 10% of Lucene search data for 
training, our system can predict execution time with less 
than 7% prediction error. 

6.1 Prediction Results for Lucene 

After profiling our Lucene search application with various 
text input queries over a corpus of the works of Shake- 
speare and the King James Bible, we obtain a dataset with 
3840 samples, each of which consists of 1 execution time, 
9 loop features, 29 branch features, and 90 variable fea- 
tures from 18 variables (we record 5 versions of values 
for each variable). So we obtain a dataset with 1 column 
of execution time and 126 columns of features, subset of 
which would hypothetically explain the execution time 
well. In the prediction modeling process, we normalize 
each column of values into range [0, 1 ] , and randomly par- 
tition data (row) samples into training set and testing set. 

We first evaluate a blackbox approach for predict- 
ing execution time. We choose one that can con- 
struct compact nonlinear models using the command 
line arguments as features (instead of using the fea- 
ture data from our profiler), which consist of the 
following ones: raw, threads, totalqueries , 
hitsperpage, repeat, denoted by xi, xi, £3, £4 
and X5, respectively. We build models using either ordi- 
nary least-square regression or LASSO with a function us- 
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Figure 6: Using Lucene search data, we show in (a) that our approach is insensitive to parameter A: there are a range 
of A £ (0, 0.07] that result in similar models for accurate prediction, and show in (b) that our approach is insensitive 
to the size of training data: even using 5% or less of data for training, our system can create models achieving accurate 
prediction for execution time. 



ing all terms in the expansion of (1 + x\ + X2 + ^3 + £4 + 
CC5) 3 . However, in either case we consistently see more 
than 38% prediction error, even when we: 1) (randomly) 
sample different portions of data for training, 2) vary the 
size of training data (for model regression) from 10% to 
40%, 3) use polynomial functions with order higher than 
3, and 4) use different subsets of the 5 features. Figure @] 
shows predicted execution time vs. actual execution time. 
As you can see, this blackbox approach derived from com- 
mand line features fails to model and predict execution 
time accurately. 

From our detailed analysis, two features that have the 
largest correlation with the execution time are feature- 1, 
totalqueries, which has a fair amount of correla- 
tion with execution time, and feature-2, thread, and the 
remaining features are poorly correlated with execution 
time. Despite some correlation in feature- 1 the model de- 
rived from command line features are not enough for pre- 
dicting execution time accurately: for each value of the 
predicted time (on x-axis), there are dramatically differ- 
ent actual execution times (on y-axis) correspond to (an 
ideal prediction is a 45 degree line pass through the ori- 
gin). This result indicates that some other factors that are 
not captured by the features should contribute to the exe- 
cution time. On the contrary, as shown in the following, 
our system can automatically select higher quality fea- 
tures from the program, and construct nonlinear models 
to predict execution time accurately. 

To evaluate our system, we start with its sensitivity to A, 
the parameter for trading off the prediction error with the 
number of selected features. We use 10% of data for train- 
ing (both feature selection and model fitting), and trace a 
variety of A values (which may result in different subset of 
selected features, thus different models) using an efficient 
algorithm proposed in ifTTl . We show the result in Fig- 
ure [6] (a). To our surprise, we see that our method is able 



to select 2-4 features (out of 126 in total) that are enough 
to build a nonlinear model to predict the execution time 
within 7% error. As expected, starting from a very small 
A value and increasing it, our method selects decreasing 
number of features (from 4 down to 1), and consequently 
results in models with decreasing prediction power. We 
clearly see that there is a range of A values (e.g., (0, 0.07]) 
that enable our method to select the right set of features 
and build models for accurate prediction. Repeating the 
experiment with different (random) training samples, and 
with 20%, 30% and 40% of data for training, we see very 
similar behavior. We conclude that our method is insensi- 
tive to the parameter A, and setting A < 0.07 allows us to 
select right features, and construct compact and accurate 
models for predicting execution time. 

Fixing A = O.OS0, we study the sensitivity of our 
method to the size of training data, and plot the result in 
Figure 0(b). We see that with different sizes of training 
data, prediction errors of the constructed models are fairy 
stable, and even using 5% or less training data, our method 
is able to produce accurate models for predicting execu- 
tion time. 

To reveal more details of the model, we use A = 0.03 
and 10% of data for training, to investigate which features 
are selected and what kinds of models are constructed. 
We find that our algorithm usually selects 3-4 features 
(depending on which subset of data are sampled for train- 
ing) and constructs a model with a prediction error around 
6.1%. Figure [5] shows predicted execution time vs. actual 
execution time of our system. In one instance of modeling, 
the following 4 features are selected: 1) loop feature I2 re- 



3 An optimal A can be determined by a cross-validation approach, 
e.g., further partitioning the training data into two sets, one for feature 
selection and model regression, and another for testing the model. An 
optimal A is the one giving the smallest testing error (on the part of 
training data selected for testing). 
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Figure 7: Predicted time vs. execution time of the Image J 
application for a blackbox approach. 
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Figure 8: Predicted time vs. execution time of the Image J 
application. Using 10% of Image J data for training, our 
system can predict execution time with 5.5% prediction 
error. 

lated to a while loop for reading keywords from the query 
file, 2) variable feature related to totalQueries, 
3) variable feature v$ related to hitsPerPage, and 4) 
variable feature Vg related to how many query processors 
to create per thread. Among them, features 1% and vg have 
the largest weights (indicating they are the most impor- 
tant) and persistently appear when sampling different por- 
tions of the data for training. With just these two features, 
we do a LASSO sparse regression with all basis functions 
of features in the expansion of ( 1 + 12 + vg ) 3 . Interestingly, 
we are able to construct the following a nonlinear model 

f(h, vg) = 0.1 + 0.52Z 2 + 0.09w 9 - 0.6911 - 0.07v% + 

+1.16^ + 0. 13^9, 

which can predict execution time with error 6.7% (indicat- 
ing the rest two selected features U3 and v$ only contribute 
to less than 1% of the prediction accuracy). 

6.2 Prediction Results for ImageJ 

ImageJ Q is a public domain Java image process- 
ing and analysis program. It provides a variety tools 
for displaying, editing, analyzing, and processing im- 



ages in many formats. We test a dozen of tools 
of ImageJ, including Smooth, Find Edges, FFT, 
Find Maxima, etc. We choose to profile and predict 
the execution time of Find Maxima, because it exhibits 
high variance in execution time when processing different 
images (even with similar size), making it a challenging 
task to model the execution time. 

To profile the Find Maxima, we use 3045 images 
from popular vision corpus of Caltech 101 (T), Event 
Dataset and PASCAL challenge 2008 dataset H). The 
images vary a lot in size and resolution, and have content 
in different scenes (e.g., in the office, on the street, in the 
natural environment, etc) and with different object cate- 
gories (e.g., plan, car, bird, building, etc). After the profil- 
ing, we obtain a dataset with 3045 samples, each of which 
consists of one execution time, 291 loop features, 2935 
branch features, and 2290 variable features from 458 vari- 
ables (we record five versions of values for each variable). 
So we obtain a dataset with one column of execution time 
and 5516 columns of features. After removing constant 
and redundant columns, we obtain 182 useful features, 
(small) subset of which would likely explain the execution 
time well. In the experiments, we normalize each column 
of values into range [0,1], and randomly partition the data 
into training set and testing set. 

For a blackbox approach, many methods can be used. 
We consider one with the execution time as a nonlinear 
function of a simple input parameter - the image size. We 
start with a degree 3 polynomial function of the image size 
x, and obtain the following prediction model using 20% 
of data for training 

f{x) = 0.1 + 2.18.x - 8.77x 2 + 36.6x 3 . 

Predicting on the remaining 80% test data, we consis- 
tently see more 35% of prediction error regardless which 
subset of data are sampled as a training set (may result in 
slightly different models). We see similar results when we 
increase the degree of the polynomial and the percentage 
of training data. 

We plot the execution time against the predicted execu- 
tion time obtained from the model in Figure [7] We clearly 
see that image size alone is not enough for predicting 
execution time regardless of whatever model may come 
out: the image size is poorly correlated with the execution 
time, and for each value of the image size or the predicted 
execution time, there are dramatically different actual ex- 
ecution times corresponding, indicating that some other 
factors should contribute to the execution time. 

On the contrary, our system can automatically select 
two high-quality features from thousands of automatically 
instrumented features, and construct a model to predict 
execution time accurately. Of the two selected features, 
one is the variable feature related to the width of region 
of interest (denoted by w); the other is height of the im- 
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Table 1 : Iterative procedure of model selection consider- 
ing the cost of computing feature values. In the example, 
loop feature I3 is related to a for loop printing search re- 
sults, and loop feature I7 is related to a while loop count- 
ing how many times queries are executed. We explained 
other features in Section l6"Tl Computing I3 requires com- 
puting the most of the program (i.e., doing actual lookups 
of indices), thus it is rejected in step 1. This iterative 
procedure stops at step 2 since all selected features are 
quickly computable. 

age (denoted by h). Although highly correlated to the ex- 
ecution time, neither a single feature (even with nonlinear 
model), nor the linear combination of both selected fea- 
tures can predict execution time very well. Instead, using 
10% of data for training with A = 0.03, we do a LASSO 
sparse regression using all (nonlinear) terms in the expan- 
sion of (1 + w + h) 3 , and obtain the following model 

f(w, h) = 0.1 + 0.08w + 0.07/i + 0.33wh + 0.02/i 2 , 

which can predict the execution time accurately (around 
5.5% prediction error), as shown in Figure [8] 

We also study the sensitivity of our method to A using 
10% of data for training (both feature selection and model 
fitting), and show the result in Figure[9](a). Again, we see 
that our method is insensitive to A, and there are a wide 
range of A values (e.g., (0, 0.11]) that allow us to select 
right features, and construct compact and accurate models 
for predicting execution time. 

Fixing A = 0.03, we study the sensitivity of our method 
to the size of training data, and plot the result in Figure [6] 
(b). Again, we see that with different sizes of training data, 
prediction errors of the constructed models are fairy sta- 
ble, and even using 5% or less training data, our method is 
able to produce accurate models for predicting execution 
time. 

6.3 Benefit of Slicing 

In this section, we evaluate the benefit of slicing. We ex- 
plain how slicing can help choose features that are not 
expensive to compute and evaluate the execution times 
of feature evaluators of selected features computed manu- 
ally. 

We look at the details on how the Lucene search appli- 
cation chose final features we presented in Section loTTI Ta- 
ble [63] shows the steps taken by the model generator due 
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Figure 10: CDF of execution time of the entire Lucene 
search program and time to execute the slice that com- 
putes loop feature I2. 

to the feedback from the sheer. In each step, we show the 
features selected by the model generator, and the features 
rejected by the slicer among the features passed from the 
model generator. For this application, at step 2, the slicer 
can compute slices that can quickly compute all the fea- 
tures needed by the model, thus it accepts the selected fea- 
tures and the feedback loop from the slicer to the model 
generator ends. 

Figure [10] shows the cumulative distribution function 
(CDF) of execution time of the entire Lucene search pro- 
gram and that of the slice to compute loop feature li- We 
show only I2 because it is the most expensive selected fea- 
ture to compute since the slice goes through files to count 
keywords. The other variable features are derived by arith- 
metic operations and assignments of values from inputs. 
Computing I2 takes 3 — 4% of the entire program execu- 
tion time, thus the prediction model can compute an esti- 
mate of execution time with low overhead. 

7 Related Work 

Prediction has been explored in multiple different con- 
texts — database, cluster and cloud, networking, and pro- 
gram complexity modeling. In this paper, we presented a 
new performance prediction framework for generic pro- 
grams by combining programming language and machine 
learning techniques. As far as we know, our work is the 
first to explore program analysis to extract features, em- 
ploy machine learning to create accurate models with se- 
lected features, and use program slicing to automatically 
produce code snippets that compute feature values for pre- 
diction. 

In the database, researchers explored machine learn- 
ing algorithms to predict database query execution time. 
Gupta, Mehta, and Dayal 12011 used a variant of decision 
trees to predict time ranges of data warehouse queries. 
Ganapathi et al. |[T8l used KCCA to predict time and re- 
source consumption of database queries (number of I/Os 
and messages) using the statistics of query texts and query 
plans (e.g., instance count for each possible database op- 
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Figure 9: Using ImageJ data, we show in (a) that our approach is insensitive to A: there are a range of A G (0, 0.11] that 
result in similar models for accurate prediction, and show in (b) that our approach is insensitive to the size of training 
data: even using 5% or less of data for training, our system can create models achieving accurate prediction. 



erator). 

In resource allocation and provisioning for cluster and 
cloud applications, research has been done to forecast how 
much resource is required to support given workload to 
meet service level agreements lfTTlfT2l . or how long it 
takes to complete a candidate assignment [31]. The mod- 
els used resource consumption or workload size for pre- 
diction. Xu et al. used console logs — coarse-grained pro- 
gram status reports — to detect anomalies in the Hadoop 
Distributed File System and the Darkstar game l37l . 

For load shedding in network monitoring applications, 
Barlet-Ros et al. ifTOl used a simple linear regression 
model of features from five packet header tuples, num- 
ber of bytes, number of packets to predict CPU resource 
usage. This work is specific to packet processing applica- 
tions. In contrast, our framework is applicable to generic 
programs. 

In the networking context, multiple projects addressed 
the problems of predicting response time changes for 
what-if scenarios. WebProphet l25l predicts the impact 
of certain optimizations of web services before deploy- 
ing them by extracting web object dependencies with in- 
jected delays and simulating web page loading processes 
with web object dependency graphs. WISE 11331 predicts 
the effects of configuration or deployment changes in 
content distribution networks by modeling the network 
dependency structure to response-time distribution. Link 
Gradients (14\ predicts the impact of network latency in 
multi-tier systems by doing delay injection and perform- 
ing spectral analysis. 

There have been studies on using information from 
execution traces for modeling computational complex- 
ity |[T9l , simulating hardware platforms efficiently |29l 
l30l . and finding bugs cooperatively l26l . In contrast to 
these, Mantis focuses on creating a model for predicting 
program execution time by computing feature values on- 
line with slices quickly for new inputs. 



Trendprof fl9l models asymptotic computational com- 
plexity by measuring empirical computational complex- 
ity. It computes a model that estimates the performance of 
a program by modeling basic block execution frequency 
in terms of user-specified features (e.g., input size) and 
summarizing the program with clusters of basic blocks. 

SimPoint [29. 30 1 finds a subset of execution instruc- 
tion traces of program for an input for efficient hardware 
platform simulation because simulating hardware for the 
entire program execution takes too long time. It instru- 
ments basic block vectors in each fixed interval and uses 
a clustering algorithm to extract a representative subset of 
traces from clusters that approximates low-level hardware 
metrics such as instructions per cycle, percent RUU occu- 
pancy, cache miss rate, branch prediction miss rate, and 
address prediction miss rate 1301 . 

Cooperative bug isolation (CBI) l26l used three pred- 
icates — branches with four values (always true, always 
false, sometimes true and sometimes false, unreachable in 
the run), comparisons between all pairs of integer-valued 
variables and constants in the program, and comparisons 
between integer valued return results of functions and 0. 
CBI aims to find which predicates are correlated with 
crashes to find bugs, and uses sampling of predicates to 
lower runtime overhead since the executed runs are col- 
lected from end users of the program. 

8 Conclusion 

In this paper, we presented Mantis, a new prediction 
framework that extracts program features using code anal- 
ysis, models performance with these features using sparse 
regression, and generates code snippets that compute the 
feature values. We take a first step towards building such 
a framework. Our prototype evaluation shows that Man- 
tis can predict execution time with more than 93% accu- 
racy for the applications we tested, a search engine and an 
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image processing application, which cannot be achieved 
with models without program features. In the future, we 
plan to evaluate our system with various complicated ap- 
plications in terms of accuracy, applicability, and scalabil- 
ity. 

Our new approach to prediction presents several excit- 
ing research directions we want to explore. First, we want 
to extend our model to include environment and to ex- 
plore more sophisticated features (e.g., feature values that 
depend on calling contexts) and more sophisticated slic- 
ing algorithms (e.g., algorithms based on dynamic control 
flow graphs). Second, we would like to build our frame- 
work for C/C++ languages with LLVM [7] since the cur- 
rent prototype works with Java programs. Third, we want 
to further extend our framework to apply to networked 
systems running on multiple nodes. Our work in this paper 
addressed single-machine program execution. Finally, we 
also would like to apply the tool to performance debug- 
ging (e.g., a tool that generates test cases for performance 
debugging similar to KLEE fPUl that generates test cases 
for correctness). 
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